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Measures of Significance in HEP and Astrophysics 
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I compare and discuss critically several measures of statistical significance in common use in 
astrophysics and in high energy physics. I also exhibit some relationships among them. 



I. INTRODUCTION 

Significance testing for a possible signal in counting 
experiments centers on the probability that an ob- 
served count in a signal region, or one more extreme, 
could have been produced solely by fluctuations of the 
background source(s) in that region. Statisticians re- 
fer to this probability as a p-value. The traditions 
for calculating signal significance differ between High 
Energy Physics (HEP) and High Energy Gamma Ray 
Astrophysics (GRA). Both fields often quote signif- 
icances in terms of equivalent standard deviations of 
the normal distributions (statisticians sometimes refer 
to this as a Z-value). 

I will present several of the commonly used meth- 
ods in HEP and GRA, apply them to examples from 
the literature, then discuss the results. Here I will 
concentrate on observed significance, the significance 
of a particular observation, rather than predictions of 
significance for a given technique as a function of ex- 
posure. The prediction problem is slightly different, 
involving the power of the test, or the probability of 
making an observation at a given significance level. 

GRA has emphasized simple, quickly-evaluated an- 
alytical formulae for calculating Z directly (choos- 
ing asymptotically normal variables), while HEP has 
typically calculated probabilities (p-values) and then 
translated into a Z-value by 

p = P(s > observed | assume only background ); 
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This relation can be written[lj for large Z > 1.5 as 



Z w \fu — Ln u; u — —1Ln(j>\j2/K) 
giving a rough dependence of Z ~ \/—Ln p. While 
more general than the search for a simple formula for 
Z-values, the HEP approach loses track of the analytic 
structure of the problem. 

Observations in GRA typically consist of a count 
of gamma rays when pointing directly at a potential 
source, called an on-source count, N on . The analogous 
quantity in HEP is the number of counts in a signal re- 
gion. The background relevant to an observation of a 
source is typically estimated in GRA by an off-source 
observation. The relative exposure of the two obser- 
vations is denoted by a = T on /T ff, often less than 
unity. Then the background count mean's estimate is 
b = aN ff, its (Poisson) uncertainty 5b = a^J N ff, 



and thus one derives 



{5bf/b 



(1) 



GRA expressions are couched in terms of a. I will also 
use x = N on , y = N a ff, k = x + y for compactness. 

In HEP, sometimes a side-band method of back- 
ground estimation is used, rather like in a GRA mea- 
surement; or b may be estimated as a sum of contri- 
butions from Monte Carlo and data-based side-band 
estimates, so that often b ± 8b is quoted, where 5b is 
derived from adding uncertainties in quadrature. One 
can use Eqf^ to define a when comparing HEP re- 
sults with GRA expressions. Non-integer values for 
effective N a ff result, but usually cause no problems. 



II. Z-VALUE VARIABLES 

Many expressions for Z are of the form of a ratio of 
estimates of signal to its variance, where the signal is 
estimated by s = N on — b — x — ay. Then Z = s/W, 
where V is a variance estimate for s. A standard 
GRA reference gives as an example (their Equa- 
tion 5) V 5 — N on + a 2 N fj. The authors note that 
this expression treats N on and N Q ff as independent; 
this does not consistently calculate V under the null 
hypothesis, /i on = a/i // an d in fact biases against 
signals for a < 1 by overestimating V. I have de- 
rived a related formula, V5/ = a(l + a)N a ff, by using 
only the background to estimate the mean and vari- 
ance: while not optimal, it at least is consistent with 
the null. They also provide Vg — a(N on + N D ff), 
which better implements the null hypothesis. How- 
ever, their widely-used recommendation is likelihood 
ratio L(n si ii b )/L(iji b ), 



Zl = V% ( x Ln- 



c(l+a) 



V Ln 



ka 1 U k 

Zl derives from the standard likelihood ratio test for a 
composite hypothesis, and Wilks' Theorem, giving its 
asymptotic normal behavior. The numerator and de- 
nominator likelihoods are each separately maximized: 
one for a signal + background model, the other for a 
background-only (null) model. 

One may instead seek an asymptotically normal 
variable with nearly constant variance Q, 

Z ° = ^(V^ + WS- y/a(y + 3/8) ). 
The 3/8 speeds convergence to normality from the un- 
derlying discreteness. 
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A. Other Frequentist Methods 

One widely used form is Z s f, = s/Vb (sometimes |4| 
called the "signal to noise ratio"). This entirely ig- 
nores the uncertainty in the background estimate. It 
is often used for optimizing selection criteria, because 
of its simplicity. Slightly better is a Zp calculated 
from the Poisson probability p- value: 

VP = P(> x\b) = e~ h V/j\ = T(x, 0, b)/T(x). 

here written in terms of an incomplete T function. 
Zp still ignores uncertainty in b. Occasionally one 
sees substitutions of b — ► b + 6b as a feeble attempt to 
incorporate the uncertainty in b. 

Finally, one may view a significance calculation di- 
rectly as a p-value calculation which one could use 
as a test of the null hypothesis. Zl use the standard 
(non-optimal) test of a composite hypothesis against a 
null. However, the relationship of the Poisson means, 
whether fi on > afi ff, is a special case of a composite 
hypothesis test that admits a more optimal solution. 
There exists a Uniformly Most Powerful test among 
the class of Unbiased tests for this case, in the form 
of a binomial proportion test for the ratio of the two 
Poisson means 0. The UMPU properties are, strictly 
speaking, derived only with an assumption of random- 
ization, that is, hiding the underlying discreteness by 
adding a random number to the data. This test yields 
a binomial probability p-value (using k = x + y): 

PBi = P Bl {> X\ W,k) = £}=* ^^yW^l-wf-i, 

where w — a/(l + a) is the expected ratio of the 
Poisson means for x and x + y. After some manipu- 
lation, this can be written in terms of incomplete and 
complete beta functionsp],|(|, which is convenient for 
numerical evaluation: 

PBi = B(w, x,l + y)/B(x, l + y) 
This test is conditional on x + y fixed because of the 
existence of a nuisance parameter: there are two Pois- 
son means, but the quantity of interest is their ra- 
tio. While this test is known to both the GRA0 and 
HEP 7] communities, it is common practice in neither, 
and its optimality properties are not common knowl- 
edge. 

Given the (restricted) optimality of the test, and 
the lack of a UMP test for this class of composite 
hypotheses, this test ought to be more frequently used 
to calculate significance, even though it is clearly a 
longer calculation than Z^. For moderate x, y, closed 
forms in terms of special functions are available, while 
some care is required for larger n. For Zb < 3, the 
Z- values reported may be somewhat too small 0, 0, 
but for typical applications one is more interested in 
Z B > 4. 

It is interesting to note that taking a normal ap- 
proximation to the binomial test (that is, comparing 
the difference of binomial proportion from its expected 
value, to the square root of its normal-approximation 
variance) yields (x/k — w)/y/w(l — w)/k, which can 



be shown to be identical to Zg — s/Vg. 

A different approach attempts to move directly 
from likelihood to significance by using a 3rd-order 
expansion^. The mathematics is interesting, combin- 
ing two first order estimates (which give significance 
to order 1/y/n) to yield a \j\frfi result. Typically, 
the first-order estimates are of the form of a normal 
deviation, Z t (like Zg), and a likelihood ratio like Z^; 
of these, the likelihood ratio is usually a better first- 
order estimate. The two are then combined into the 
third order estimate by a formula such as 
Z 3 =Z L + -±-Ln{Z t /Z L ). 

Generically, Z t = A/^/V is a Student t-like variable, 
where A is the difference of the maximum likelihood 
value of 9 (the parameter of interest) from its value un- 
der the null hypothesis, and V is a variance estimate 
derived from the Fisher Information d 2 L/d 2 9. The 
attraction of the method is to achieve simple formu- 
lae with accurate results. However, the mathematics 
becomes more comDlex|l(1| when nuisance parameters 
are included, as is needed when the background is im- 
perfectly known. Here I will only compare the approx- 
imate calculation for a perfectly known background to 
the corresponding exact calculation, pp. 



III. BAYESIAN METHODS 

HEP common practice often involves Bayesian 
methods of incorporating "systematic" uncertainties 
for quantities such as efficiencies ^lj. These methods 
are also used for calculating significances, particularly 
when the background 6 is a sum of several contribu- 
tions, since the method naturally extends to complex 
situations where components of 8b are correlated. The 
typical calculation represents the lack of knowledge of 
b by a posterior density function p(b\y); it is referred 
to as a posterior density because it is posterior to the 
off-source measurement y. The usual way of proceed- 
ing is to calculate Poisson p-values pp = P(> x\b) as 
was done above, but this time taking into account the 
uncertainty in b by performing an average of p-values 
weighted by the Bayesian posterior p(b\y), that is 

PBa = / Pp{> x\b) p(b\y) db 
This can be evaluated by Monte Carlo integration, or 
by a mixture of analytical and numerical methods. I 
will pursue the latter course here. The most common 
usage in HEP is to represent p(b\y) as a truncated 
normal distribution 

= j^m e ^ ~%bf ' . b > ■ 

If b is a sum of many contributions, its distribution 
should asymptotically approach a normal. An alter- 
native I have advocated in HEP Hal . and which is also 
known to the GRA communititv[l3l| . is to start from 
a flat prior for b and derive the p(b\y) in the usual 
Bayesian fashion, leading to a Gamma posterior: 
pr(%) = Fe-t/y\ , f3 = b/a. 
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This is most appropriate when a single contribution 
to b dominates and its uncertainty is actually due to 
counting statistics. I will refer to the Z-values which 
result from these two choices as Zn for the normal 
posterior, and Zr for the Gamma function posterior. 
Choosing to represent pp as a sum, and performing 
the b integration first gives the p- value for the Gamma 
posterior [ 



Pr 



Eoo (y+jV- a j 
j=x j\y\ (l+Q) 1 + a+3 



Despite appearances, pr is identical to ps%- The Beta 
function representation of pBi is much more suitable 
for large values of x, y. The two expressions can be 
made somewhat closer by using w = a/(l + a). 

Bayesian practice typically focuses on direct com- 
parison of specific hypotheses through the odds ratio. 
However, predictive inference is commonly used in 
model checking (significance testing is just checking 
the background-only model). Predictive inference in 
our case is directly related to calculating p(x\y), that 
is, averaging over the unknown parameter b. 

p(J\y) = I p(j\ b ) p( b \y) db 

Interestingly, some Bayesian practitioners go farther, 
and are willing to calculate a "Bayesian p- value" j 

PBayes = Y.f=xP(j\ X ) 

which is precisely the ps a given above (there we 
summed before integrating). 



IV. COMPARISON OF RESULTS: RELATIVE 
PERFORMANCE 

I have taken several interesting test cases from the 
HEP and GRA literature. The input values and Z- 
value calculation results are shown in Table 1. For 
the HEP cases, the values reported in the papers are 
N on , b, and 5b, while in the GRA case, the reported 
values are N on , N ft, and a. I have also included a 
few artificial cases in order to sample the parameter 
space reasonably. 

It is worth remarking that there are numerical issues 
to be faced in evaluation of the more complex meth- 
ods. These remarks apply-at a minimum-to a Math- 
ematica implementation. The Binomial is straightfor- 
ward in its Beta function representation. The Bayes 
p- value methods may involve an infinite sum, and are 
touchy and slow for large n\ llj suggests approxi- 
mating the summation by an integral. Fraser-Reid 
and the Bayes p- value summation results may be sen- 
sitive to whether integers are floating point values are 
used. An alternative attack is to leave the pp as a T 
function ratio and trade an integration for the infinite 
sum. Doing so in the Bayes Gaussian case is less un- 
stable than summing, but for large n requires hints on 
the location of the peak of the integrand. 

For the purposes of the present section, I will take 
the Frequentist UMPU Binomial ratio test as a refer- 
ence standard, because of its optimality properties. I 



will have more to say on this later. 

None of these examples from the recent literature 
was published with a seriously wrong significance 
level. To me, the most striking result in the table 
is that the Bayes Gamma prior method produces re- 
sults identical to the Binomial result (MSU graduate 
student HyeongKwan Kim has proven the identity) . 

The method most used in HEP, Bayes with a nor- 
mal posterior for b, produces Z's always larger than 
those from Bayes Gamma. Viewing the calculation as 
averaging the Poisson p- value pp(b) over the posterior 
for b, the shorter tails of the normal compared to the 
gamma place less weight on the larger probabilities 
(smaller p- values) obtained when the off-source mea- 
surement happens to underestimate the true value of 
b. The difference is most striking for large values of 
a, that is, when the background estimate is performed 
with less sensitivity than the signal estimate; in this 
case, results differing in significance by over .5 a can 
occur. The most common method in GRA, the simple 
Log Likelihood ratio formula, produces comparable or 
slightly higher estimates of significance, but seems less 
vulnerable to problems at large a. It appears to claim 
the highest significance of these methods at small n. 
The variance stabilization method Zq presented in Q 
does not appear to be in general use in GRA, but 
produces results of similar quality to the other two 
mainstay methods. All methods agree for N > 500, 
where the normal approximations are good, even out 
to 3-6 a tails. 

The "not recommended" methods all produce re- 
sults off by more than .5 a for several low-statistics 
cases. Zg, which approximates Zbu does best; Z$ is 
indeed biased against real signals compared to other 
measures, and its alleged improvement Z^i, while cur- 
ing that problem, overestimates significance as the 
price for its less efficient use of information compared 
to Zq. 

As expected, ignoring the uncertainty in the back- 
ground estimate leads to overestimates of the signif- 
icance. s/Vb is much more over-optimistic than an 
exact Poisson calculation, particularly for small n, or 
a > 1, where the background uncertainty is most im- 
portant. The best that can be said for s/Vb is that it 
is mostly monotonic in the true significance, at least as 
it is typically used (for comparing two selection crite- 
ria with N varying by an order of magnitude at most). 
The 3rd order Fraser-Reid approximation is fast and 
accurate up to moderate n, suggesting it is worth pur- 
suing the full nuisance parameter case. However, the 
approximation fails for one large Z, and is very slow 
for the largest n. 

Of the ad-hoc corrections for signal uncertainty, 
none are reliable; the "corrected" Poisson calculation 
is less biased than the un-corrected, but still widely 
overestimates significance for a > 1, and can't be used 
for serious work. The s/^Jb + 5b isn't much better 
than its "un-corrected" version. 
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FIG. 1: Contours of equal Z, case [Ti|. for Zsb (left) and 
Z L (right). 



To summarize, most bad formulae overestimate sig- 
nificance (the only exceptions are Z$ for a < 1 and 
Poisson with b ^ b + Sb). Thus, prudence demands 
using a formula with good properties. The Binomial 
test seems best for simple Poisson backgrounds. For 
backgrounds with several components, compare Bayes 
MC with r or Normal posteriors. 

V. CALIBRATION OF ABSOLUTE 
SIGNIFICANCE: MONTE CARLO 

In the previous section, results of significance cal- 
culations were compared to a reference calculation, 
the UMPU Binomial Test. That method produces 
the lowest reported significance among the methods 
with a sound theoretical basis. This alone could jus- 
tify its use (on grounds of conservatism) , but would 
beg the question of whether the Binomial test is actu- 
ally "correct." This has been studied by Monte Carlo 
simulation^ in 0. 

A few observations on MC testing are useful. One 
might imagine simply generating instances of Pois- 
son variables x, y with means li, Li/a, and calculating 
Z MC from p MC = the fraction of events "more signal- 
like" than (N on , N a f f). Instead, 0,13 a separate MC 
is done for each individual measure, because there is 
no unique "correct" Z-value for a given observation. 
The best that can be done is to ask that a method 
produce a Z value consistent with MC probabilities 
when the observation is analyzed by that method. 
The problem is that there is no unique definition of 
"more signal-like" . One is essentially trying to find a 
unique ordering of points on the x, y plane to define 
those which are similarly far from the observed point 



N on ,N off . 

Each variable introduces its own metric, and con- 
tours of equal Z do not coincide for different Z vari- 
ables, as seen in Figure 1. 

The p- value for an observation (xo,yo) depends on 
these contours: 

p MC {x ,y ) = fz >Z() p(x,y) dx dy 
where the integration is over the region beyond the 
contour line Zq passing through the observation: 
Z(x,y) > Z Q (x ,y ). 

For small n, the contours are markedly different, 
so that two different Z-values could both be correct if 
each agreed with their respective Z MC . Still, the situ- 
ation is not catastrophic, as values of Z are not wildly 
different, and presumably the Z MC differ somewhat 
less than the reported values in Table 1. For larger 
n, the contours become straighter and more similar, 
and more importantly, the probability becomes more 
peaked, so that a smaller region contributes. Thus, 
the central limit forces convergence to a unique Z 
value for large n. 

Although Monte Carlo studies can never explore the 
entire parameter space, the general conclusion of 
is that Zbi is the best of the alternatives. Zbi is 
only slightly conservative for Z > 3. There, p B i is a 
bit larger than p MC and thus Z Bi < Z MC by 3% or 
less on the Z scale when mm(N on ,N ff) < 20, and 
Zbi performs even better for larger n. They found 
the deviations of other methods from Z MC are typi- 
cally larger. They also cite workQ which finds larger 
fractional deviations |25| for Zb% for smaller Z. Since 
Z > 3 is the lower edge of the region where claims 
are liable to be made, and the degree of conservatism 
is small, this would also justify accepting Zbi as the 
reference standard, and as the recommended method 
of evaluating significance when there is any concern 
about the validity of other methods-at least when a 
single counting uncertainty dominates the knowledge 
of the background. 
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